The major objectives of this training wil be to;
Provide an understand the basics of R programming language and its relevance to local government operations.
Learn how to manipulate and analyze data using R.
Improve proficiency in generating visualizations to communicate insights effectively.
Boost confidence in the use of R language as a routine tool like other tools such as excel, and word document.
Develop the skills to undertake data-driven initiatives within the council.
R is a programming language commonly used for statistical computing and graphics. It is free, open-source, and provides a vast array of statistical and graphical techniques for data analysis and visualization. Users can perform data manipulation, modeling, and visualization tasks efficiently using R.
RStudio is an integrated development environment (IDE) for R programming. It provides a user-friendly interface for writing, executing, and debugging R code. RStudio includes features such as syntax highlighting, code completion, and built-in help documentation to support R programming. It also offers tools for managing projects, viewing data, and generating visualizations, making it a comprehensive environment for data analysis and R programming.
Within the local council R language can be used to perform various tasks such as managing payrolls in calculating salaries and generating payroll reports, production of documents that include text and visuals, analyzing and visualizing council data to identify trends, patterns, and insights that can be used to inform policy decisions within the council.
R download
Visit this link https://www.r-project.org/. Click where the red arrow is pointing to in the following images..
Note; For macOS users, you also need to download Quartz at Quart.
R studio download
Visit the following link (https://posit.co/) and follow the following procedures graphically displayed
After downloading and installing R and R studio, open R studio. The following interface is going to be displayed.
The source editor is where you write and edit your R code. It provides features such as syntax highlighting (color-coding different elements of the code), automatic indentation, and code completion. You can open multiple script files in separate tabs for easier organization and navigation of your code.
The console is an interactive environment where R commands are executed and output is displayed. You can type commands directly into the console, and R will execute them immediately, showing the results. It also displays error messages, warnings, and other messages generated by R.
The environment pane displays information about the objects (variables, datasets, functions, etc.) currently in your R session. It shows the names, types, and values of objects, and allows you to interact with them (e.g., remove objects, import datasets). The environment pane helps you keep track of your workspace and manage objects effectively.
The history pane shows a history of the commands that have been executed in the console. It provides a record of previous commands, allowing you to review, re-execute, or modify them as needed. You can filter the history by date or keyword to find specific commands quickly.
The files pane allows you to navigate the files and directories on your computer. You can browse, open, edit, and save files directly from within RStudio without using an external file explorer. It supports various file types, including R scripts, text files, CSV files, and more.
The plots pane displays graphical plots generated by R code. When you create plots using functions like plot() or ggplot(), they are shown in the plots pane. You can interact with plots (e.g., zooming, panning) and export them as images or PDF files.
The packages pane provides information about R packages that are installed on your system. It lists all installed packages, along with their version numbers and descriptions. You can install, update, and remove packages using buttons and commands in the packages pane.
The help pane provides access to R documentation and help files. You can search for specific functions, packages, or topics, and view detailed information about them. It includes descriptions, usage examples, arguments, and references to related functions or topics.
The viewer pane displays various types of content such as HTML files, images, and interactive visualizations. It is commonly used to view HTML reports generated by RMarkdown documents, interactive plots, and other web-based content. The viewer pane allows for interactive exploration of output and results within RStudio.
Open Source: Freely available for download, use, and modification.
Extensive Statistical Capabilities: Comprehensive suite of statistical and graphical techniques.
Rich Visualization Tools: Packages like ggplot2 offer high-quality, customizable plots.
Active and Supportive Community: Large community provides support.
Reproducibility and Documentation: Facilitates reproducible research practices.
In R, objects are similar to containers used to store data, much like a box where you store your belongings. These objects can hold various types of data, such as numbers, text, or even more complex structures like data frames or lists. When you create an object in R, you give it a name of your choice, and you can then use that name to refer to the data stored within it.
To create an object in R, you use the assignment operator, which can be either <- or =, to assign a value or the result of an expression to a name. Here’s a simple example:
In this example, 5 + 10 is evaluated, and the result (15) is stored in an object named a.
Once you’ve created an object, you can access its contents by simply typing the name of the object. R will then display the value stored within that object. Additionally, you can use the print() function to explicitly display the contents of an object. For example:
-- [1] 15
-- [1] 15
Both of these methods will show the value of the object a, which in this case is 15. You can find all objects crearted stored within the environment once you click the environment pane.
Data types are classifications that specify the nature of the data stored in objects. Understanding data types is crucial for effective data manipulation and analysis. Here’s a brief description of the main data types in R:
Numeric data types represent numerical values, including integers and real numbers (floating-point numbers).
# Below is a floating or double numeric data
num1 <- 3.15
# Below is an integer numeric data
num2 <- 2
# You can use class to check what type of data you have
class(num1)-- [1] "numeric"
Character data types represent text data enclosed in quotation marks.
Logical data types represent binary values indicating true or false.
Factor data types represent categorical data with predefined levels. Factors are created using the factor() function.
factor <- factor(c("low", "medium", "high"))
# To see what value factors as assigned to each level, you can use str
str(factor)-- Factor w/ 3 levels "high","low","medium": 2 3 1
Date and time data types represent specific points in time or durations. Date objects are created using the Date() function, while time objects can be created using the POSIXct or POSIXlt classes.
date <- as.Date("2023-12-31")
time <- as.POSIXct("2023-12-31 12:00:00")
# Using str to check the data type
str(date)-- Date[1:1], format: "2023-12-31"
-- POSIXct[1:1], format: "2023-12-31 12:00:00"
In R, objects can hold various types of data, allowing for flexible and versatile data storage and manipulation. There are five main types of objects:
Vectors
Matrices
Arrays
Lists
Dataframes
A vector is the simplest form of R object. it takes on only one form of data type.
# Numeric vector (integer)
vector1 <- c(1, 2, 3, 4, 5)
# Numeric vector (folating numbers)
vector2 <- c(1.43, 5.45, 6.98)
# Character vector
character_vector <- c("Diamond", "Gold", "Silver")
# Logical vector
logical_vector <- c(TRUE, FALSE, TRUE, FALSE)
# Factor vector representing categorical data
factor_vector <- factor(c("Agree", "strongly agree", "diagree", "strongly disagree"))A matrix is a two-dimensional array that stores data in rows and columns. It is a special case of a vector object with two dimensions, where all elements are of the same data type. Matrices are useful for organizing data in a tabular format, similar to a spreadsheet.
# Create a 3x3 matrix with numeric values
matrix1 <- matrix(data = c(1, 2, 3, 4, 5, 6, 7, 8, 9), nrow = 3, ncol = 3)
matrix1-- [,1] [,2] [,3]
-- [1,] 1 4 7
-- [2,] 2 5 8
-- [3,] 3 6 9
# Create a 2x4 matrix with character values
matrix2 <- matrix(data = c("a", "b", "c", "d", "e", "f", "g", "h"), nrow = 2, ncol = 4)
matrix2-- [,1] [,2] [,3] [,4]
-- [1,] "a" "c" "e" "g"
-- [2,] "b" "d" "f" "h"
# Create a 2x2 matrix with logical values
matrix3 <- matrix(data = c(TRUE, FALSE, FALSE, TRUE), nrow = 2, ncol = 2)
matrix3-- [,1] [,2]
-- [1,] TRUE FALSE
-- [2,] FALSE TRUE
In R, an array is a multi-dimensional extension of a matrix, allowing for more than two dimensions. Like matrices, arrays store data in a structured format, but they can have multiple dimensions, making them suitable for representing higher-dimensional data.
# Create a 3-dimensional array with numeric values
array1 <- array(data = c(1, 2, 3, 4, 5, 6, 7, 8, 9), dim = c(3, 3, 2))
array1-- , , 1
--
-- [,1] [,2] [,3]
-- [1,] 1 4 7
-- [2,] 2 5 8
-- [3,] 3 6 9
--
-- , , 2
--
-- [,1] [,2] [,3]
-- [1,] 1 4 7
-- [2,] 2 5 8
-- [3,] 3 6 9
# Create a 2-dimensional array with character values
array2 <- array(data = c("a", "b", "c", "d", "e", "f"), dim = c(2, 3))
array2-- [,1] [,2] [,3]
-- [1,] "a" "c" "e"
-- [2,] "b" "d" "f"
# Create a 4-dimensional array with logical values
array3 <- array(data = c(TRUE, FALSE, TRUE, FALSE), dim = c(2, 2, 2, 2))
array3-- , , 1, 1
--
-- [,1] [,2]
-- [1,] TRUE TRUE
-- [2,] FALSE FALSE
--
-- , , 2, 1
--
-- [,1] [,2]
-- [1,] TRUE TRUE
-- [2,] FALSE FALSE
--
-- , , 1, 2
--
-- [,1] [,2]
-- [1,] TRUE TRUE
-- [2,] FALSE FALSE
--
-- , , 2, 2
--
-- [,1] [,2]
-- [1,] TRUE TRUE
-- [2,] FALSE FALSE
Accessing a specific element of an array
-- [1] 8
-- [1] "f"
In R, a list is a versatile data structure that can hold a collection of objects of different types. As mentioned earlier, if objects are like boxes where you store your belongings, then lists can be likened to a store that houses these boxes. Lists provide a flexible way to organize and manage heterogeneous data, allowing you to store vectors, matrices, arrays, data frames, and even other lists within a single object.
# We are going to create a list of the arrays we created earlier
list_first <- list(array1, array2, array3)
# Seeing what the output looks like
list_first-- [[1]]
-- , , 1
--
-- [,1] [,2] [,3]
-- [1,] 1 4 7
-- [2,] 2 5 8
-- [3,] 3 6 9
--
-- , , 2
--
-- [,1] [,2] [,3]
-- [1,] 1 4 7
-- [2,] 2 5 8
-- [3,] 3 6 9
--
--
-- [[2]]
-- [,1] [,2] [,3]
-- [1,] "a" "c" "e"
-- [2,] "b" "d" "f"
--
-- [[3]]
-- , , 1, 1
--
-- [,1] [,2]
-- [1,] TRUE TRUE
-- [2,] FALSE FALSE
--
-- , , 2, 1
--
-- [,1] [,2]
-- [1,] TRUE TRUE
-- [2,] FALSE FALSE
--
-- , , 1, 2
--
-- [,1] [,2]
-- [1,] TRUE TRUE
-- [2,] FALSE FALSE
--
-- , , 2, 2
--
-- [,1] [,2]
-- [1,] TRUE TRUE
-- [2,] FALSE FALSE
From the list created above, we can observe a hierarchical structure. Each element in the list encapsulates a distinct object, be it an array, vector, matrix, or any other data structure. The double brackets [[]] are used to access the individual objects contained within the list, allowing for selective retrieval and manipulation of the stored objects. This hierarchical organization enables the efficient storage and management of diverse data types within a single entity, facilitating seamless data handling and analysis in R.
Let us have look below on how to access objects within a list
-- , , 1
--
-- [,1] [,2] [,3]
-- [1,] 1 4 7
-- [2,] 2 5 8
-- [3,] 3 6 9
--
-- , , 2
--
-- [,1] [,2] [,3]
-- [1,] 1 4 7
-- [2,] 2 5 8
-- [3,] 3 6 9
Addinng an object to a list:
# let us create a vector
b <- c(1, 2, 3, 4, 5, 6, 7, 8)
# adding the vector to the list
list_first[[4]] <- b
# looking at the list
list_first-- [[1]]
-- , , 1
--
-- [,1] [,2] [,3]
-- [1,] 1 4 7
-- [2,] 2 5 8
-- [3,] 3 6 9
--
-- , , 2
--
-- [,1] [,2] [,3]
-- [1,] 1 4 7
-- [2,] 2 5 8
-- [3,] 3 6 9
--
--
-- [[2]]
-- [,1] [,2] [,3]
-- [1,] "a" "c" "e"
-- [2,] "b" "d" "f"
--
-- [[3]]
-- , , 1, 1
--
-- [,1] [,2]
-- [1,] TRUE TRUE
-- [2,] FALSE FALSE
--
-- , , 2, 1
--
-- [,1] [,2]
-- [1,] TRUE TRUE
-- [2,] FALSE FALSE
--
-- , , 1, 2
--
-- [,1] [,2]
-- [1,] TRUE TRUE
-- [2,] FALSE FALSE
--
-- , , 2, 2
--
-- [,1] [,2]
-- [1,] TRUE TRUE
-- [2,] FALSE FALSE
--
--
-- [[4]]
-- [1] 1 2 3 4 5 6 7 8
As we can see above list can store different varieties of object.
We can also have a list containg different lists
# let us create another list
## first creating random vectors
a <- c("sun", "star", "moon")
c <- c(1, 2, 3, 4, 5)
d <- c("earth", "mars", "venus")
list_second <- list(a, c, d)
# Now creating a list of lists
list_combine <- c(list_first, list_second)
list_combine-- [[1]]
-- , , 1
--
-- [,1] [,2] [,3]
-- [1,] 1 4 7
-- [2,] 2 5 8
-- [3,] 3 6 9
--
-- , , 2
--
-- [,1] [,2] [,3]
-- [1,] 1 4 7
-- [2,] 2 5 8
-- [3,] 3 6 9
--
--
-- [[2]]
-- [,1] [,2] [,3]
-- [1,] "a" "c" "e"
-- [2,] "b" "d" "f"
--
-- [[3]]
-- , , 1, 1
--
-- [,1] [,2]
-- [1,] TRUE TRUE
-- [2,] FALSE FALSE
--
-- , , 2, 1
--
-- [,1] [,2]
-- [1,] TRUE TRUE
-- [2,] FALSE FALSE
--
-- , , 1, 2
--
-- [,1] [,2]
-- [1,] TRUE TRUE
-- [2,] FALSE FALSE
--
-- , , 2, 2
--
-- [,1] [,2]
-- [1,] TRUE TRUE
-- [2,] FALSE FALSE
--
--
-- [[4]]
-- [1] 1 2 3 4 5 6 7 8
--
-- [[5]]
-- [1] "sun" "star" "moon"
--
-- [[6]]
-- [1] 1 2 3 4 5
--
-- [[7]]
-- [1] "earth" "mars" "venus"
Data frames are fundamental objects in R, and you’ll find yourself working with them extensively. In R, many datasets, including those you import, are stored as data frames. You can think of data frames as similar to your Excel spreadsheets or CSV files but within R. They provide a structured way to organize and analyze data, allowing you to perform a wide range of analyses and visualizations effortlessly. As an example, let’s consider the “iris” dataset, one of the many built-in data frames in R.
-- Sepal.Length Sepal.Width Petal.Length Petal.Width Species
-- 1 5.1 3.5 1.4 0.2 setosa
-- 2 4.9 3.0 1.4 0.2 setosa
-- 3 4.7 3.2 1.3 0.2 setosa
-- 4 4.6 3.1 1.5 0.2 setosa
-- 5 5.0 3.6 1.4 0.2 setosa
-- 6 5.4 3.9 1.7 0.4 setosa
The “iris” dataset contains 150 rows and several columns, which could occupy considerable space in our training material. To keep it concise, I’ve used the head() function to display only the first 6 rows of the dataset. However, for a comprehensive view of the entire dataset, a better approach is to utilize the view() function. This function opens a separate window or tab displaying the entire dataset, allowing for a more detailed examination. Let’s try it out below.
View(iris)
I will stop here for now regarding data frames as we will come back to them and deal with them more extensively and carry out more interesting tasks such as visualizations.
In wrapping up this introduction, let’s delve into some more key concepts pivotal for building a strong foundation in R. We’ll explore arithmetic operations, functions, R packages, and R projects.
Operators in R are essential elements used to perform various operations on data, variables, or expressions. They serve as the building blocks for mathematical calculations, logical evaluations, comparison tasks, and more.
Let’s have look at some of the operators available in R:
Arithmetic Operators:
Addition (+): Used to add two values together.
Subtraction (-): Used to subtract one value from another.
**Multiplication (*):** Used to multiply two values.
Division (/): Used to divide one value by another.
Exponentiation (^ or ): Raises one value to the power of another.
Integer Division (%/%): Divides one value by another and returns the integer portion of the result.
Modulus (%%): Computes the remainder of dividing one value by another.
Logical Operators:
AND (&): Returns TRUE if both conditions are TRUE.
OR (|): Returns TRUE if at least one condition is TRUE.
NOT (!): Negates a logical value, returning TRUE if the original value is FALSE and vice versa.
Comparison Operators:
Equal to (==): Checks if two values are equal.
Not equal to (!=): Checks if two values are not equal.
Less than (<): Checks if one value is less than another.
Greater than (>): Checks if one value is greater than another.
Less than or equal to (<=): Checks if one value is less than or equal to another.
Greater than or equal to (>=): Checks if one value is greater than or equal to another.
Assignment Operators:
Leftward assignment (<-): Assigns a value to a variable.
Rightward assignment (->): Assigns a value to a variable in the opposite direction.
Special Operators:
**Matrix Multiplication (%*%):** Performs matrix multiplication.
Element Matching (%in%): Checks if elements of one vector are present in another.
Sequence Generation (:): Generates a sequence of values from start to end.
At the heart of R lie functions. A function essentially represents a set of instructions that tells R how to perform a specific task. R boasts an extensive library of functions, thousands in fact. Some are integrated into R’s core program (the software you downloaded and installed), while others are added on via “contributed packages” (a brief explanation on packages next) sourced from CRAN (Comprehensive R Archive Network) or other repositories.
In concept, functions in R resemble the built-in functions familiar to users of spreadsheet software like Excel. For instance, Excel’s sum and count functions are akin to their counterparts in R. Functions serve as the backbone of virtually every operation we carry out in R.
Throughout this training, we’ve already encountered several functions, including head(), view(), list(), array(), and dataframe(), each serving different purposes. As you progress in your R journey, you’ll eventually craft your own functions tailored to specific tasks.
To understand what a function does and how to use it, you can employ the ?function() command, which provides detailed documentation and usage examples. This command serves as a valuable resource for exploring and mastering the myriad functions available in R.
?function()
Functions are housed within what are known as packages. Think of a package as a specialized toolkit, comprising various functions, data, and code tailored to tackle specific tasks in R. While it’s entirely possible to create your own functions independent of any package, packaging your functions into a cohesive unit offers advantages, particularly in terms of sharing and collaboration.
Despite the plethora of functions available within R’s core program, an expansive ecosystem of packages has emerged, boasting over 19,000 options and counting. These packages extend R’s capabilities, catering to diverse needs across different domains.
One standout package worth mentioning is the Tidyverse. Renowned for its versatility and utility, Tidyverse bundles together essential packages designed for data manipulation, wrangling, analysis, and visualization all under one roof.
To harness the power of a package, the first step is installation. This is a one-time process accomplished using the install.packages() function. Once installed, you can access the package’s functionality by loading it into your R environment using the library() function. It’s important to note that installation is a one-time affair, while loading is required each time you restart your R Studio environment.
Here’s a quick example demonstrating how to install and load the Tidyverse package:
# installing the package
install.packages("tidyverse")
# loading the package
library(tidyverse)
projects serve as a structured framework for organizing your work. Picture yourself juggling multiple projects simultaneously, each with its own datasets, files, and scripts. It is similar to managing different sets of documents, each belonging to its designated folder.
R projects provide an elegant solution to this organizational challenge. They offer a dedicated workspace where you can neatly organize all the components of your project in one place. This includes datasets, scripts, documentation, and any other related files.
By creating separate projects for each endeavor, you can maintain a clear separation of tasks and resources, preventing clutter and confusion. This structured approach not only enhances organization but also promotes reproducibility and collaboration.
Whether you’re working on a single project or juggling multiple endeavors, harnessing the power of R projects ensures a streamlined and efficient workflow, empowering you to focus on your data analysis tasks with clarity and precision.
Creating an R project is straight forward and outlined below;
Now, let’s bring together everything we’ve learned in this introductory section by creating a visualization. We’ll leverage our understanding of R’s functions, object types, and packages to craft a compelling visualization.
-- ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
-- ✔ dplyr 1.1.4 ✔ readr 2.1.5
-- ✔ forcats 1.0.0 ✔ stringr 1.5.1
-- ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
-- ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
-- ✔ purrr 1.0.2
-- ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
-- ✖ dplyr::filter() masks stats::filter()
-- ✖ dplyr::lag() masks stats::lag()
-- ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
-- Sepal.Length Sepal.Width Petal.Length Petal.Width
-- Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
-- 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
-- Median :5.800 Median :3.000 Median :4.350 Median :1.300
-- Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
-- 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
-- Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
-- Species
-- setosa :50
-- versicolor:50
-- virginica :50
--
--
--
# Now, let's create a simple scatter plot to visualize the relationship between the Sepal Length and Sepal Width of the iris flowers:
# Create a scatter plot
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point() +
labs(title = "Sepal Length vs. Sepal Width",
x = "Sepal Length",
y = "Sepal Width") ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)):
ggplot() initializes a new ggplot object, which is a blank canvas for creating visualizations.
data = iris specifies the dataset we’re working with, in this case, the built-in iris dataset.
aes() stands for aesthetic mapping. Here, we’re specifying how variables from the dataset should be mapped to visual properties of the plot.
x = Sepal.Length specifies that the Sepal.Length variable from the dataset should be mapped to the x-axis.
y = Sepal.Width specifies that the Sepal.Width variable from the dataset should be mapped to the y-axis.
color = Species specifies that the Species variable from the dataset should be used to color the points on the plot, creating distinct groups based on species.
geom_point():
geom_point() adds a layer to the plot, representing the data points as individual points (in this case, a scatter plot).
labs(title = “Sepal Length vs. Sepal Width”, x = “Sepal Length”, y = “Sepal Width”):
labs() allows us to set the labels for the plot’s title, x-axis, and y-axis.
title = “Sepal Length vs. Sepal Width” sets the title of the plot.
x = “Sepal Length” sets the label for the x-axis.
y = “Sepal Width” sets the label for the y-axis.
In the upcoming class, we will delve deeper into the capabilities of ggplot, a versatile package for creating stunning visualizations in R. We’ll explore its powerful features and learn how to craft impactful plots for data analysis and presentation.
Additionally, we will introduce another indispensable package called dplyr, which is renowned for its prowess in data manipulation tasks. You’ll discover how dplyr simplifies common data wrangling tasks and empowers you to work with data more efficiently.
Furthermore, we will delve into the functionality of various operators in R, gaining a deeper understanding of how they work and how to leverage them effectively in your data analysis workflows.
But that’s not all, we will carry out these explorations within the context of a dynamic document format known as Rmarkdown. This versatile document format allows for seamless integration of code, visualizations, and explanatory text, making it an invaluable tool for reproducible research and collaborative projects.